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Abstract: More accessible data and the rise of advanced data analysis contribute to 
using complex models in decision-making across various fields. Nevertheless, 
protecting people’s privacy is vital. Medical predictions often employ decision trees 
due to their simplicity; however, they may also be a source of privacy violations. We 
will apply differential privacy to this end, a mathematical framework that adds 
random values to the data to provide secure confidentiality while maintaining 
accuracy. Our novel method Dual Noise Integrated Privacy Preservation (DNIPP) 
focuses on building decision forests to achieve privacy. DNIPP provides more 
protection against breaches in deep sections of the tree, thereby reducing noise in final 
predictions. We combine multiple trees into one forest using a method that considers 
each tree’s accuracy. Furthermore, we expedite this procedure by employing an 
iterative approach. Experiments demonstrate that DNIPP outperforms other 
approaches on real datasets. This means that DNIPP offers a promising approach to 
reconciling accuracy and privacy during sensitive tasks. In DNIPP, the strategic 
allocation of privacy budgets provides a beneficial compromise between privacy and 
utility. DNIPP protects privacy by prioritizing privacy concerns at lower, more 
vulnerable nodes, resulting in accurate and private decision forests. Furthermore, the 
selective aggregation technique guarantees the privacy of a forest by combining 
multiple data points. DNIPP provides a robust structure for decision-making in 
delicate situations, ensuring the model's effectiveness while safeguarding personal 
privacy. 


Introduction 

Personal information has _ been increasingly 
acknowledged for quite some time. The societies based 
on data are constantly spewing out the intimate details of 
ourselves. New technology such as data mining takes 
advantage of personal data and can offer personalized 
services or products in different sectors like web search 
engines or healthcare (Abadi et al., 2016). Advanced data 
mining techniques have the potential to improve medical 
services for patients. However, external knowledge from 
healthcare databases during mining could unintentionally 
compromise patient confidentiality (Jain et al., 2023). 
Although mining electronic medical records holds 
promise for exploring disease relationships and medical 
treatments, it also raises concerns regarding the exposure 


of confidential patient information (Abouelmehdi et al., 
2018). 

Privacy-preserving data mining addresses the problem 
by employing techniques that maintain the data's secrecy 
while providing valuable business insights (Jain et al., 
2024). Classification is one of the basic techniques for 
data mining and is crucial in predictive analytics(Bu et 
al., 2021). The popular model of classification decision 
trees has excellent accuracy, but it may also have privacy 
risks because it requires counting (Bettini et al., 2015; 
Bonawitz et al., 2020). The robust framework known as 
differential privacy is useful in checking individual 
privacy leakages; differential privacy checks against 
breaches of privacy by ensuring that any changes made to 
individual records do not bias calculations based on data 
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(Jain et al., 2015). Initially introduced for statistical 
database security purposes, this idea has now become 
common in PPDM, involving clustering, classification, 
and deep learning (Feng et al., 2005; Yavanamandha et 
al., 2023; Mondal et al., 2023; Kumar et al., 2023). In 
recent years, building differentially private tree-based 
models has been a successful endeavor (Claerhout et al., 
2005). However, most existing approaches typically 
overlook the issue of allocating an adequate privacy 
budget, which can sometimes negatively impact the 
overall performance of the model (Gupta et al., 2020; Cui 
et al., 2019). 

This article suggests an alternative construction for 
private trees that represents a more refined approach to 
budget allocation. 

1. Our contributions 
algorithmic model for budget allocation that allocates 
different budgets to nodes depending on their position 


also. encompass creating an 


within the tree based on their position within the tree, 

thereby reducing performance degradation due to 

improper budget allocation. 

2. We suggest a method for selective aggregation to 
enhance the generality and prediction accuracy of 
ensemble models, as well as an iterative approach to 
facilitate speedup in the process. 

3. To verify the efficiency of our classification model 
that ensures privacy preservation and individual 
protection, we perform simulation experiments on real 
datasets. 

The paper's structure is as follows: Section 2 contains 
reviews of related works; the remainder of 3 introduces 
the preliminaries; and the remainder of 4 mainly 
describes our proposed DNIPP scheme and the system's 
threat model. Sections 5 discuss the construction of 
private decision trees, the selective aggregation process, 
and the evaluation of the accuracy and efficiency of 
DNIPP. Finally, 
conclusion. 


Section 6 presents the paper's 


Literature Review 

At present, various techniques are employed to put 
data under data security, such as anonymization 
techniques (Cui et al., 2019). Procedures that make 

data to safeguard privacy 
data anonymization methods. 
Nonetheless, they can't effectively defend themselves 
against attacks because modeling the attacker's 
background knowledge poses a challenge (Miller et al., 
2009). Differential privacy provides a strong and 
practical definition of privacy protection by preventing 


generalizations 
characterize 


over 
these 


attackers from extracting precise individual information 
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from computation results (Jain et al., 2015; Yadav and 


Singh, 2023). This concept has thus gained considerable 
attention in the realm of privacy-preserving data mining 
(PPDM) (Malin et al., 2004). 

While decision trees their 


transparency in data mining, this attribute can pose a 


are renowned for 
threat to privacy when attackers exploit it to extract 
information (Yang et al., 2018). To solve this problem, 
some decision tree algorithms with differential privacy 
have been proposed. For instance, the SuLQ-based ID3 
algorithm was proposed by (Jain et al., 2015), which used 
differential privacy when evaluating attribute information 
gain by including Laplacian noise in computing query 
results (Sharma et al., 2018). However, its effectiveness 
continuously decreases because it lowers significantly the 
classification accuracy (Li et al., 2015). 

To tackle these problems, DiffP-ID3 and DiffP-C4.5 
were developed with an exponential mechanism for 
selecting splitting attributes to maximize classification 
accuracy while protecting individuals’ privacy at the 
same time (Tayefi et al., 2017). Besides that, some 
approaches use ensemble methods like random forests 
that can help reduce the negative impact of noises on the 
model’s behavior. Freidman and Schuster came up with 
an efficient way to construct a differentially private [D3 
classifier that reveals its efficacy across datasets of 
different sizes. Alternatively, a few authors proposed a 
differentially private random forest algorithm that 
randomly selects split attributes among internal nodes. 
(Feng et al., 2005) devised a differentially private 
ensemble method to enhance model accuracy by reducing 
privacy requirements. Certain methodologies concentrate 
on reducing the randomness inherent in the exponential 
mechanism. Fletcher and Islam proposed an alternative 
by advocating for the use of local sensitivity, as opposed 
to global sensitivity, in calculating the score function's 
sensitivity (Yin et 2018). 
recommended the creation of a random forest with soft 


al., Furthermore, they 
sensitivity. 

Despite these advancements, the majority of current 
algorithms fail to account for noise tolerance at varying 
depths within trees. They introduced an adaptive budget 
allocation method that continuously allocates privacy 
budgets for queries and provides consistent accuracy 
results. this 
spending on privacy parameter 
attempting to optimize the allocation for each query is 
still an unsolved problem (Zhu et al., 2020). The main 
goal of this paper is to bridge this gap by developing a 


However, approach causes additional 


calculations, and 


well-tuned strategy for allocating privacy budgets so that 
they are more effective. 


This section discusses two techniques employed in 
differential privacy that focus on its foundational concept 
(Zhang et al., 2020). Then, we will discuss the Gini 
Index, which is one of the important metrics used in 
selecting optimal split attributes during tree construction 
(Zheng et al., 2017). 

Differential Privacy 

The differential privacy technique ensures that adding 
or removing any record from a dataset has a negligible 
effect on computation outcomes. Consequently, it 
prevents the extraction of precise individual information 
from the results. 

Definition 1: Differential Privacy 

Differential privacy concerns a 
computation F, where Range(F) encompasses all possible 
outcomes. Given adjacent datasets D: and D2 differing by 
one record (|D:AD2| = 1), if algorithm F satisfies: 

Pr(F(D1)ES) < ee: Pr(F(D2)eES) 

For any subset S of Range (F), F is said to uphold ¢- 


randomized 


differential privacy. Here, ¢ denotes the privacy budget, 
inversely proportional to the level of privacy protection. 
Definition 2: Sensitivity 

The sensitivity of a function f: D—Rd, operating on 
an arbitrary domain (D) and producing a (d)-dimensional 
real number vector, is defined as: 

Af=max D1, D2 where |DI1AD2|=1 ||f(D1)-f(D2)|| 1 
Usually, to achieve (¢)-differential privacy for numerical 
queries, noise drawn from a calibrated Laplace 
distribution is added to the query results. 

Definition 3: Laplace Mechanism 

For a function f: D—Rd, where D is an arbitrary 
domain, the Laplace mechanism ensures ¢-differential 
privacy and is defined as: 

F(D)=f(d)+Laplace(eAf) 

However, for non-numerical queries, the exponential 
mechanism is employed to maintain  ¢-differential 
privacy. 

Definition 4: Gaussian Mechanism 

For every domain D, given an arbitrary function f: 
D- V”, the function F offers €-differential privacy, the 
Gaussian noise is as follows- 

P(Y) = ae 


Definition 5: The Exponential Mechanism 

Suppose we have a random mechanism (M) with 
dataset (D) as input and entity object ré Range as output, 
and a score function q (D, r) assigning scores to each 
output, with 6q representing its sensitivity, The 
mechanism M maintains s-differential privacy if: 

M @, q) ={return r with probability « exp( 2Aq 
eq(D,r)) 
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We have reformulated this expression using different 
mathematical symbols to provide a fresh and effective 
perspective. 

Gini Index 

When making a decision tree, most decisive thing to 
consider is determining the best split attribute selection 
criteria. CART used the Gini Index as a criterion. This 
index, which measures the "purity" of samples, takes the 
following form: 

Gini(D)=1->, i =I npj2 

The symbol pj represents the proportion of the jth 
sample in the sample set. For attribute A, the Gini Index 
can be defined as follows: 

GD, C(a)=yv=1 VID||Da=v]|-Gini (Da=v, c) 

Where, |Da=v| means a subset of samples with an 
equaling attribute v, and |D| is for all instances. We 
calculate this subset's Gini index using the formula Gini 
(Da = v, c). 

Thus, when building decision trees, one should select 
candidate attributes that minimize the Gini index before 
and after division. This approach provides optimal 
attribute selection throughout the tree construction 
process. 

Information Entropy 

In the data analysis domain, entropy becomes an 
important measure to know how uncertain our data is. It 
measures, essentially, how much surprise or randomness 
exists in a dataset. The more we know about a dataset, the 
lower its entropy. Greater dataset uncertainty or 
unpredictability increases entropy. Mathematically, 
entropy can be expressed using the following formula: 


n 
Ep =). —Pylogs P 
j=l 


Information Gain (IG) 

The term "Information Gain (IG)" is a pivotal factor in 
the development of decision trees. It stands in an inverse 
relationship with entropy, a measure of uncertainty. The 
process of computing information gain is recursive, 
continuing until the leaf nodes of the decision tree reach 
an entropy value of 0, indicating no further splitting is 
necessary. The calculation of information gain is crucial 
for each decision tree node, computed as: 

IG = Ep - (mi /n) * (Ec)i 
Where: 

Ep is the entropy of the original dataset 

mi is the total number of instances in each of the i-th 
children datasets 

n is the total number of instances in the parent dataset. 

(Ec)i denotes the entropy of i-th child dataset. 
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Input (Original Data Set i.e. DB) 


Calculate Noise Vectors 


Generation of Anonymized Data Set 


Sampling of Data Set(i-e. 
DB1,DB2,DB3) 


Apply Improved version of Random 
Forest 


Computation of Accuracy, Precision 
and Recall of Resultant Data Set using 
Ensemble Method 


Comparison 


Figure 1. Flowchart of proposed work. 


One can compute information gain using either Gini 
impurity or entropy, but usually the former produces 
more accurate results. This is what is done in this work 
by introducing a new scheme called DNIPP. 


Proposed Work 

It ensures that malicious investigators cannot extract 
individual privacy information from data sets. Therefore, 
it helps to build decision trees with strong utility 
preservation and privacy preservation as proposed in this 
work. This prevents malicious analysts from extracting 
individual privacy information from the datasets. The 
core idea of the DNIPP scheme is to selectively aggregate 
disjoint subsets into a forest. This strategy mitigates the 
potential performance degradation that a single private 
tree might encounter due to the additional randomness 
introduced for privacy protection. During _ tree 
construction, data miners continuously submit queries 
along with privacy budgets. Nevertheless, once the 
privacy budget is exhausted, additional queries become 
impractical. Moreover, leaf nodes and internal nodes 
have differing levels of tolerance to noise. Therefore, we 
propose a new budget allocation strategy that assigns a 
larger privacy budget to nodes at deeper levels, partially 
mitigating the problem of excessive noise introduced by 
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leaf nodes. The flowchart of the proposed model is shown 
in Figure 1. 


Methodology 

For instance, take a record set named Heart Disease 
Dataset including information on 1024 patients about 
heart disease characteristics. It contains 14 features 
ranging from both numerical and categorical values. Both 
numerical and categorical attributes are present in this 
dataset. The DNIPP scheme proposed in this work 
enables the creation of decision trees with strong utility 
preservation and privacy preservation. It prevents 
malicious analysts from extracting individual privacy 
information from the datasets. The main idea behind 
DNIPP lies in selectively aggregating disjoint subsets 
into a forest which mitigates potential performance 
degradation that could be due to extra randomness 
brought into a single private tree by other means. 
introduced for privacy protection. This work presents a 
novel technique for building highly accurate decision 
forests while ensuring data privacy. Our approach 
prioritizes privacy in leaf nodes, which are particularly 
vulnerable to noise introduced for privacy protection. 
Dataset 

The dataset is considered here as the Heart Disease 
Dataset. The dataset contains information about 1024 


Table 1. Original Heart Disease Dataset i.e., DB. 
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1020 | 59 1 1 140 221 0 1 164 1 0.0 2 0 2 1 
1021 | 60 1 0 125 258 0 0 141 1 2.8 1 1 3 0 
1022 | 47 1 0 110 275 0 0 118 1 1.0 1 1 2 0 
1023 | 50 0 0 110 254 0 0 159 0 0.0 2 0 2 1 
1024 | 54 1 0 120 188 0 1 113 0 1.4 1 1 3 0 
Table 2. Noisy Dataset i.e., DB. 

Age Sex | Cp ‘Trestbps Chol | Fbs | Restecg | Thalach | Exang Oldpeak) Slope Ca Thal | target | noise 
1020 | 53.19 | 1 1 140 | 221 | 0 1 164 1 0.0 2 0 2 1 -5.80 
1021 | 60.65 | 1 0 125 | 258 | O 0 141 1 2.8 1 1 3 0 0.65 
1022 | 46.95 | 1 0 110 | 275 | 0 0 118 1 1.0 1 1 2 0 -.043 
1023 | 51.71 | O 0 110 | 254 | 0 0 159 0 0.0 2 0 2 1 1.711 
1024 | 55.35 | 1 0 120 | 188 | 0 1 113 0 1.4 1 1 3 0 1.35 


Table 3. After Row Sampling First Sample of Dataset i.e., DB1. 
Restec 


Age Sex) Cp Trestbps; Chol) Fbs s Thalach Exang | Oldpeak Slope Ca _ Thal | target 
276 | 56.41} 1 0 132 207 0 1 168 1 0.0 2 0 3 1 
784 | 54.32) 1 2. 150 232 0 0 165 0 1.6 2 0 3 1 
856 | 63.51 | 0 2 120 211 0 0 115 0 1.5 1 0 2 1 
795 | 63.40} 1 1 128 208 1 0 140 0 0.0 2 0 2 1 
477 | 58.50) 1 2 128 229 0 0 150 0 0.4 1 1 3 0 
796 | 38.28} 1 1 135 203 0 1 132 0 0.0 1 0 1 1 
893 | 54.33} 1 0 128 204 1 1 156 1 1.0 1 0 0 0 
828 | 43.57] 1 2 130 233 0 1 179 1 0.4 2 0 2 1 
179 | 57.88 | 0 0 134 409 0 0 150 1 1.9 1 2 3 0 


Table 4. After Row Sampling Second Sample of Dataset i.e., DB2. 


Age Sex | Cp | Trestbps| Chol Fbs Restecg | Thalach Exang Oldpeak | Slope Ca | Thal | target 
231 | 56.33 | 1 1 120 236 | 0 1 178 0 0.8 2 0 2 1 
241 | 66.69 | 1 2 152 212 | 0 0 150 0 0.8 1 0 3 0 
742 | 63.99 | 1 0 130 330 | 1 0 132 1 1.8 2 3 3 0 
179 | 57.88 | 0 0 134 409 | 0 0 150 1 1.9 1 2 3 0 
170 | 47.78 | 1 0 150 247 | 0 1 171 0 1.5 2 0 2 1 
476 | 57.10 | 1 0 165 289 | 1 0 124 0 1.0 1 3 3 0 
839 | 45.73 | 1 0 140 261 | 0 0 186 1 0.0 2 0 2 1 
819 | 59.72 | 0 0 170 225 | 1 0 146 1 2.8 1 2 1 0 
366 | 62.27] 1 2 112 230 | O 0 165 0 2.5 1 1 3 0 


Table 5. After Row Sampling Third Sample of Dataset i.e., DB3. 


Age | Sex Cp Trestbps Chol | Fbs | Restecg | Thalach | Exang | Oldpeak | Slope | Ca Thal _ target 
1 


319 53.37 0 2 128 216 0 0 15 0 0.0 2 0 0 1 
702 70.46 | 0 1 160 302 0 1 162 0 0.4 2 2 2 1 
296 64.72 1 0 120 237 0 1 71 0 1.0 1 0 2 0 
262 48.68 1 0 122 222 0 0 186 0 0.0 2 0 2 1 
867 49.97 1 1 110 235 0 1 153 0 0.0 2 0 2 1 
687 55.24 1 0 125 300 0 0 171 0 0.0 2 2 3 0 
419 61.82 | 0 2 160 360 0 0 151 0 0.8 2 0 2 1 
545 47.57 1 1 110 229 0 1 168 0 1.0 0 0 3 0 
367 46.17 1 1 110 229 0 1 168 0 1.0 0 0 3 0 


patients and their attributes are related to heart disease. It 
has 14 features both in numerical form and categorical. 
This dataset consists of both numerical and categorical 
variables. Numerical variables include Age, Trestbps 
(resting blood pressure), Chol (serum cholesterol), 
Thalach (maximum heart rate), and Oldpeak (ST 
depression induced by exercise). Categorical attributes 
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include Sex, Cp (chest pain type), Fbs (fasting blood 
sugar), Restecg (resting electrocardiographic results), 
Exang (exercise-induced angina), Slope (slope of peak 
exercise ST segment), Ca (number of major vessels 
colored by fluoroscopy), Thal (thallium stress test result), 
and target (presence or absence of heart disease). The 
heart disease dataset is shown in Table 1. 
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x[O] <= 52.179 
gini = 0.444 
samples = 9 
value = [6, 3] 


x{31 <= 215.0 
gini = 0.375 
samples = 4 
value = [1, 3] 


x[1]-<='0.5 
gini = 0.5 
samples = 2 
value = [1, 1] 


gini = 0.0 
samples = 5 
value = [5, 0] 


gini = 0.0 
samples = 2 
value = [0, 2] 


gini = 0.0 gini = 0.0 
samples = 1 samples = 1 
value = [0, 1] value = [1, 0] 


Figure 2. Decision Tree corresponding to row sampled data DB1. 


x[0] <= 50.988 


gini = 0.444 
samples = 9 
value = [6, 3] 


x[0] <= 60.926 
gini = 0.245 

samples = 7 
value = [6, 1] 


gini = 0.0 
samples = 2 
value = [0, 2] 


gini = 0.0 gini = 0.0 
samples = 6 samples = 1 


value = [6, 0]| | value = [0, 1] 


Figure 3. Decision Tree corresponding to row sampled data DB2. 


x[3] <= 175.0 
gini = 0.444 
samples = 9 
value = [6, 3] 


x[0] <= 40.628 
gini = 0.375 
samples = 8 
value = [6, 2] 


gini = 0.0 
samples = 1 
value = [0, 1] 


x[2] <= 1.5 

gini = 0.245 

samples = 7 
value = [6, 1] 


gini = 0.0 


samples = 1 
value = [0, 1] 


x[2] <= 0.5 
gini = 0.444 
samples = 3 
value = [2, 1] 


gini = 0.0 
samples = 4 
value = [4, 0] 


gini = 0.0 gini = 0.0 
samples = 2 samples = 1 
value = [2, 0] value = [0, 1] 


Figure 4. Tree corresponding to row sampled data DB3. 


Modified Dataset sampling. Three samples have been generated for each 
A sensitive feature like “age” is identified that has to type of sampling. The purpose of generating samples of 
be anonymized by dual noise integration which is shown noisy data sets is to feed each sample data set into a 
in Table 2. decision tree classifier. A decision tree is generated for 
Three types of sampling have been considered for the each sample of the data set. In this technique, the final 
noisy data in the experimental analysis that is row-wise_ prediction is done based on aggregation of all the 
sampling, column-wise sampling, and combined predictions generated by all the decision trees. 
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Results & Discussions 
Experimental Evaluation and Analysis 

This section provides a detailed evaluation of the 
performance and effectiveness of the proposed algorithm 
(DNIPP). We assess the algorithm using several metrics 
such as precision, recall, accuracy, and Fl-score. This 
overall analysis gives an insight into what the algorithm 
can do best and some limitations it may have. Moreover, 
experimental studies are performed to show fast 
computation of important parameters, including Gaussian 
noise, information entropy, Gini impurity, information 
gain, and hyperparameter tuning. These experiments 
demonstrate the scalability of the algorithm’s 
computations as well as its computational efficiency. 

At random forest classification various features will 
be assessed by their importance therefore feature 
importance vector will be found by — 

Table 7. Feature Importance Value. 


Value representing 
feature importance 


Name of Feature 


age 0. 08777226 

Sex 0. 03875317 

CP 0. 1577776 

Trestbps 0. 06804852 

Chol 0. 05536884 

Fbs 0. 00838654 

Restecg 0. 01418357 

Thalach 0. 08071837 

Exang 0. 06114553 

Oldpeak 0. 12830501 

Slope 0. 05535863 

Ca 0.05535863 

Thal 0.12655215 

target 0.12562981 

Feature Importances 
cp 
oldpeak 
ca 
thal 
age 
thalach 
exang 
trestbps 
chol 
slope 
sex 
restecg 
fos 
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 


Relative Importance 
Figure 5. Feature Importance Graph. 
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Evaluation Metrics and Criteria 

Accuracy is a vital metric for assessing the 
performance of a classification model, including our 
proposed approach. It indicates the proportion of correct 
predictions made by the model. For our DNIPP 
algorithm, the accuracy score is 0.961089, indicating that 
it correctly classifies 96.11% of the instances. 

This surpasses the accuracy achieved by the baseline 
BDPT method, which stands at 0.78. This significant 
improvement demonstrates the effectiveness of our 
privacy-preserving mechanisms in maintaining high 
accuracy while protecting sensitive data. 

Figure 6 visually compares the accuracy scores 
achieved by different privacy-preserving models. Figure 
6 visually compares the accuracy scores achieved by 
different privacy-preserving models. 


Accuracy Comparison 


oo 
oO 


78 
70 74 


ACCURACY % 
3 


40 


ODiffp-C4.5 WAdiffp OBDPT GODNIPP 


PREVICY PRESERVATION TECHNIQUES 


Figure 6. Comparison of accuracy between proposed 
and existing systems. 


Various performance metrics, such as F-1 score, 
recall, support, and precision, have been computed to 
evaluate the proposed technique DNIPP. Figure 7 shows 
the comparison among various performance metrics. 


Comparison of Performance Metrics 


150 
S 
3 100 
= 
6 
v 50 
@ 
> 
0 
Precision Recall F-1 Score Support 
Performance Metrics 
gOm1 


Figure 7. Comparison of performance matrices. 


Conclusion & Future Work 

This work introduces a new method of constructing 
decision trees with data _ privacy. To ensure 
confidentiality, our work focuses on privacy in leaf 
nodes, which are most affected by noise. As trees grow 
deeper and fewer samples are available per node, noise 
introduces itself into leaf nodes, increasing their 
susceptibility to noise distortions. Hence, we propose a 
selective noise integration strategy that adds little noise to 
the leaves while balancing the trade-off between personal 
data protection and accuracy. In addition, our selective 
aggregation technique allows us to choose trees that 
contribute the most positively to the overall performance 
of the forest. This ensures that, despite preserving 
privacy, the aggregated forest remains highly accurate. 
Experimental results indicate that, compared with 
previous methods, this approach achieves an excellent 
balance between privacy and utility. 
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